About me

I am a computational biologist working in the department of Pathobiology & Diagnostic Investigation (CVM). In our group, we develop computational approaches to understand infectious disease biology. Check out my webpage for more info. You can reach me here.

I’m also the founder & co-organizer of the R-Ladies East Lansing group on campus. We conduct R-related workshops & meetups regularly! So, do check out our upcoming events on Meetup.

Part 1: Getting Started w/ readr

You can access all relevant material pertaining to this workshop here. Other related workshops & useful cheatsheets.

Installation and set-up

Install RStudio

Running RStudio locally? Download RStudio

Want to try the latest ‘Preview’ version of RStudio? RStudio Preview version

Trouble with local installation? Login & start using RStudio Cloud right away!

New to RStudio IDE? Use Help Page #1 & Page #2

Install R

… if you haven’t already! The RStudio startup message should specify your current local version of R. For e.g., R v3.5.2

Install packages

  • Install: once
  • Load: each time
install.packages("tidyverse")   # for data wrangling
install.packages("here")        # to set paths relative to your current project.
# install.packages("gapminder") # sample dataset

Explore the RLEL workshops for more examples and sample codes for Tidy Data and DataViz [e.g., >> presentations/20181105-workshop-tidydata uses the gapminder & USArrests datasets].

Trouble installing tidyverse?

  • Check your R/RStudio versions
  • For the time-being install the individual packages using install.packages("PACKAGENAME")
  • Check out the whole tidyverse suite of packages here
# If tidyverse installation fails, install individual constituent packages this way...
install.packages("readr")    # Importing data files
install.packages("tidyr")    # Tidy Data
install.packages("dplyr")    # Data manipulation
install.packages("ggplot2")  # Data Visualization (w/ Grammar of Graphics)
install.packages("readxl")   # Importing excel files

Loading packages

library(tidyverse)
## ── Attaching packages ────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.3
## ✓ tidyr   1.0.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## ── Conflicts ───────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
# OR load the individual packages, if you have trouble installing/loading `tidyverse`
# library(readr)
# library(readxl)
# library(tidyr)
# library(dplyr)
# library(ggplot2)
library(here) # https://github.com/jennybc/here_here
## here() starts at /Users/jananiravi/GitHub/tidyverse-genomics
# library(gapminder) # useful dataset for data wrangling, visualization

Some useful cheatsheets

Cheatsheets @RStudio | Our cheatsheets repo

More resources towards the end of the document.

Data import

  • read_csv, write_csv
  • read_tsv, write_tsv
  • read_delim, write_delim
  • here::here
here::here() # where you opened your RStudio session/Project
## [1] "/Users/jananiravi/GitHub/tidyverse-genomics"
# Downloading sample genomics dataset from NCBI's FTP site
url <- "ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE69nnn/GSE69360/suppl/GSE69360_RNAseq.counts.txt.gz"
gse69360 <- read_tsv(url) # to read tab-delimitted files
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Geneid = col_character(),
##   Chr = col_character(),
##   Start = col_character(),
##   End = col_character(),
##   Strand = col_character()
## )
## See spec(...) for full column specifications.
?read_tsv # to check defaults
# Comma-separated values, as exported from excel/spreadsheets
read_csv(file="path/to/my_data.csv", col_names=T)
# Other atypical delimitters
read_delim(file="path/to/my_data.txt", col_names=T, delim="//")

# Other useful packages
# readxl by Jenny Bryan
read_excel(path="path/to/excel.xls",
           sheet=1,
           range="A1:D50",
           col_names=T)

## Tip: Always open .Rproj and use relative paths with here()
## Example with here()
read_tsv(file=here("data/GSE69360_RNAseq.counts.txt"), col_names=T)

Knowing your data

Dataset details:

  • AA: Agilent Adult; AF: Agilent Fetus
  • BA: BioChain Adult; BF: BioChain Fetus
  • OA: OriGene Adult
  • Tissues: Colon, Heart, Kidney, Liver, Lung, Stomach
str(gse69360)              # Structure of the dataframe
gse69360                   # Data is in a cleaend up 'tibble' format by default
head(gse69360)             # Shows the top few observations (rows) of your data frame
glimpse(gse69360)          # Info-dense summary of the data
View(head(gse69360, 100))  # View data in a visual GUI-based spreadsheet-like format

colnames(gse69360)         # Column names
nrow(gse69360)             # No. of rows
ncol(gse69360)             # No. of columns

gse69360[1:5,7:10]         # Subsetting a dataframe

## saving the data file
write_tsv(gse69360[1:100,7:12], "gse_subset.txt")
library(knitr)
kable(head(gse69360))
Geneid Chr Start End Strand Length AA_Colon AA_Heart AA_Kidney AA_Liver AA_Lung AA_Stomach AF_Colon AF_Stomach BA_Colon BA_Heart BA_Kidney BA_Liver BA_Lung BA_Stomach BF_Colon BF_Stomach OA_Stomach1 OA_Stomach2 OA_Stomach3
ENSG00000223972.4 chr1;chr1;chr1;chr1 11869;12595;12975;13221 12227;12721;13052;14412 +;+;+;+ 1756 1 0 0 0 0 0 7 2 0 0 0 0 7 0 0 3 3 1 0
ENSG00000227232.4 chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1 14363;14970;15796;16607;16854;17233;17498;17602;17915;18268;24734;29321;29534 14829;15038;15947;16765;17055;17368;17504;17742;18061;18379;24891;29370;29806 -;-;-;-;-;-;-;-;-;-;-;-;- 2073 36 21 32 28 17 17 102 41 10 24 22 6 49 2 153 157 38 53 18
ENSG00000243485.2 chr1;chr1;chr1 29554;30267;30976 30039;30667;31109 +;+;+ 1021 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ENSG00000237613.2 chr1;chr1;chr1 34554;35245;35721 35174;35481;36081 -;-;- 1219 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ENSG00000268020.2 chr1;chr1 52473;54830 53312;54936 +;+ 947 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0
ENSG00000240361.1 chr1 62948 63887 + 940 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
library(rmarkdown)
paged_table(gse69360)

Part 2: Reshaping data w/ tidyr

gather()   # Gather COLUMNS -> ROWS
spread()   # Spread ROWS -> COLUMNS
separate() # Separate 1 COLUMN -> many COLUMNS
unite()    # Unite several COLUMNS -> 1 COLUMN

Gather, Spread

  • gather: Gather columns into key-value pairs. Wide -> Long
  • spread: Spread a key-value pair across multiple columns: Long -> Wide
# Gather all columns except 'Geneid'
gse69360 %>%
  select(Geneid, matches("[AF]_")) %>%
  gather(-Geneid, key="Sample", value="Counts") # wide -> long format
## # A tibble: 1,098,580 x 3
##    Geneid            Sample   Counts
##    <chr>             <chr>     <dbl>
##  1 ENSG00000223972.4 AA_Colon      1
##  2 ENSG00000227232.4 AA_Colon     36
##  3 ENSG00000243485.2 AA_Colon      0
##  4 ENSG00000237613.2 AA_Colon      0
##  5 ENSG00000268020.2 AA_Colon      0
##  6 ENSG00000240361.1 AA_Colon      0
##  7 ENSG00000186092.4 AA_Colon      0
##  8 ENSG00000238009.2 AA_Colon      3
##  9 ENSG00000239945.1 AA_Colon      0
## 10 ENSG00000233750.3 AA_Colon      0
## # … with 1,098,570 more rows
# Gather, then Spread --> Back to original data
gse69360 %>%
  select(Geneid, matches("[AF]_")) %>%
  gather(-Geneid, key="Sample", value="Counts") %>%
  spread(key = "Sample", value = "Counts")     # spread to turn back to the original data!
## # A tibble: 57,820 x 20
##    Geneid AA_Colon AA_Heart AA_Kidney AA_Liver AA_Lung
##    <chr>     <dbl>    <dbl>     <dbl>    <dbl>   <dbl>
##  1 ENSG0…      735       90       504      913      65
##  2 ENSG0…       16        2        11        6       0
##  3 ENSG0…      474      149       161      316     139
##  4 ENSG0…      329       83       142      370      73
##  5 ENSG0…      191       20        33      153      25
##  6 ENSG0…       93      229        78      232     472
##  7 ENSG0…     1947      815       270    47964     863
##  8 ENSG0…      416      283       755      909     273
##  9 ENSG0…      556      145      1466     1553     202
## 10 ENSG0…      321      107       128      115     156
## # … with 57,810 more rows, and 14 more variables:
## #   AA_Stomach <dbl>, AF_Colon <dbl>, AF_Stomach <dbl>,
## #   BA_Colon <dbl>, BA_Heart <dbl>, BA_Kidney <dbl>,
## #   BA_Liver <dbl>, BA_Lung <dbl>, BA_Stomach <dbl>,
## #   BF_Colon <dbl>, BF_Stomach <dbl>, OA_Stomach1 <dbl>,
## #   OA_Stomach2 <dbl>, OA_Stomach3 <dbl>

Unite, Separate

  • unite: Unite multiple columns into one
  • separate: Separate one column into multiple columns
gse69360 %>%
  select(Geneid, matches("[AF]_")) %>%               # selecting only Counts columns
  gather(-Geneid, key="Sample", value="Counts") %>%  # wide -> long tidying up
  separate(Sample, into=c("Source_Stage", "Tissue"), sep="_") # separate logically
## # A tibble: 1,098,580 x 4
##    Geneid            Source_Stage Tissue Counts
##    <chr>             <chr>        <chr>   <dbl>
##  1 ENSG00000223972.4 AA           Colon       1
##  2 ENSG00000227232.4 AA           Colon      36
##  3 ENSG00000243485.2 AA           Colon       0
##  4 ENSG00000237613.2 AA           Colon       0
##  5 ENSG00000268020.2 AA           Colon       0
##  6 ENSG00000240361.1 AA           Colon       0
##  7 ENSG00000186092.4 AA           Colon       0
##  8 ENSG00000238009.2 AA           Colon       3
##  9 ENSG00000239945.1 AA           Colon       0
## 10 ENSG00000233750.3 AA           Colon       0
## # … with 1,098,570 more rows
gse69360 %>%
  select(Geneid, matches("[AF]_")) %>%
  gather(-Geneid, key="Sample", value="Counts") %>%
  separate(Sample, into=c("Source_Stage", "Tissue"), sep="_") %>%
  separate(Source_Stage, into=c("Source", "Stage"), sep=1) # separate by char position
## # A tibble: 1,098,580 x 5
##    Geneid            Source Stage Tissue Counts
##    <chr>             <chr>  <chr> <chr>   <dbl>
##  1 ENSG00000223972.4 A      A     Colon       1
##  2 ENSG00000227232.4 A      A     Colon      36
##  3 ENSG00000243485.2 A      A     Colon       0
##  4 ENSG00000237613.2 A      A     Colon       0
##  5 ENSG00000268020.2 A      A     Colon       0
##  6 ENSG00000240361.1 A      A     Colon       0
##  7 ENSG00000186092.4 A      A     Colon       0
##  8 ENSG00000238009.2 A      A     Colon       3
##  9 ENSG00000239945.1 A      A     Colon       0
## 10 ENSG00000233750.3 A      A     Colon       0
## # … with 1,098,570 more rows
gse69360 %>%
  select(Geneid, matches("[AF]_")) %>%
  gather(-Geneid, key="Sample", value="Counts") %>%
  separate(Sample, into=c("Source_Stage", "Tissue"), sep="_") %>%
  separate(Source_Stage, into=c("Source", "Stage"), sep=1) %>%
  unite(Stage_Tissue, Stage, Tissue) # combining a different set of columns
## # A tibble: 1,098,580 x 4
##    Geneid            Source Stage_Tissue Counts
##    <chr>             <chr>  <chr>         <dbl>
##  1 ENSG00000223972.4 A      A_Colon           1
##  2 ENSG00000227232.4 A      A_Colon          36
##  3 ENSG00000243485.2 A      A_Colon           0
##  4 ENSG00000237613.2 A      A_Colon           0
##  5 ENSG00000268020.2 A      A_Colon           0
##  6 ENSG00000240361.1 A      A_Colon           0
##  7 ENSG00000186092.4 A      A_Colon           0
##  8 ENSG00000238009.2 A      A_Colon           3
##  9 ENSG00000239945.1 A      A_Colon           0
## 10 ENSG00000233750.3 A      A_Colon           0
## # … with 1,098,570 more rows

Part 3: Data wranging with dplyr

filter()    # PICK observations by their values | ROWS
select()    # PICK variables by their names | COLUMNS
mutate()    # CREATE new variables w/ functions of existing variables | COLUMNS
transmute() # COMPUTE 1 or more COLUMNS but drop original columns
arrange()   # REORDER the ROWS
summarize() # COLLAPSE many values to a single SUMMARY
group_by()  # GROUP data into rows with the same value of variable (COLUMN)

Filter

  • filter: Return rows with matching conditions
head(gse69360) # Snapshot of the dataframe
## # A tibble: 6 x 25
##   Geneid Chr   Start End   Strand Length AA_Colon AA_Heart
##   <chr>  <chr> <chr> <chr> <chr>   <dbl>    <dbl>    <dbl>
## 1 ENSG0… chr1… 1186… 1222… +;+;+…   1756        1        0
## 2 ENSG0… chr1… 1436… 1482… -;-;-…   2073       36       21
## 3 ENSG0… chr1… 2955… 3003… +;+;+    1021        0        0
## 4 ENSG0… chr1… 3455… 3517… -;-;-    1219        0        0
## 5 ENSG0… chr1… 5247… 5331… +;+       947        0        0
## 6 ENSG0… chr1  62948 63887 +         940        0        0
## # … with 17 more variables: AA_Kidney <dbl>,
## #   AA_Liver <dbl>, AA_Lung <dbl>, AA_Stomach <dbl>,
## #   AF_Colon <dbl>, AF_Stomach <dbl>, BA_Colon <dbl>,
## #   BA_Heart <dbl>, BA_Kidney <dbl>, BA_Liver <dbl>,
## #   BA_Lung <dbl>, BA_Stomach <dbl>, BF_Colon <dbl>,
## #   BF_Stomach <dbl>, OA_Stomach1 <dbl>, OA_Stomach2 <dbl>,
## #   OA_Stomach3 <dbl>
# Now, filter by condition
filter(gse69360, Length<=50)
## # A tibble: 124 x 25
##    Geneid Chr   Start End   Strand Length AA_Colon AA_Heart
##    <chr>  <chr> <chr> <chr> <chr>   <dbl>    <dbl>    <dbl>
##  1 ENSG0… chr1… 3829… 3829… -;-        45        1        0
##  2 ENSG0… chr1  9521… 9521… -          41        7        1
##  3 ENSG0… chr1  1477… 1477… +          35        0        1
##  4 ENSG0… chr1… 1545… 1545… +;+        21        0        0
##  5 ENSG0… chr1  2360… 2360… -          50        1        2
##  6 ENSG0… chr2… 1160… 1160… +;+        36        1        0
##  7 ENSG0… chr2  8916… 8916… -          38      183        1
##  8 ENSG0… chr2  8916… 8916… -          37       98        2
##  9 ENSG0… chr2  8916… 8916… -          38       80        0
## 10 ENSG0… chr2  8916… 8916… -          38        0        0
## # … with 114 more rows, and 17 more variables:
## #   AA_Kidney <dbl>, AA_Liver <dbl>, AA_Lung <dbl>,
## #   AA_Stomach <dbl>, AF_Colon <dbl>, AF_Stomach <dbl>,
## #   BA_Colon <dbl>, BA_Heart <dbl>, BA_Kidney <dbl>,
## #   BA_Liver <dbl>, BA_Lung <dbl>, BA_Stomach <dbl>,
## #   BF_Colon <dbl>, BF_Stomach <dbl>, OA_Stomach1 <dbl>,
## #   OA_Stomach2 <dbl>, OA_Stomach3 <dbl>
# Can be rewritten using "Piping" %>%
gse69360 %>%   # Pipe ('then') operator to serially connect operations
  filter(Length <= 50)
## # A tibble: 124 x 25
##    Geneid Chr   Start End   Strand Length AA_Colon AA_Heart
##    <chr>  <chr> <chr> <chr> <chr>   <dbl>    <dbl>    <dbl>
##  1 ENSG0… chr1… 3829… 3829… -;-        45        1        0
##  2 ENSG0… chr1  9521… 9521… -          41        7        1
##  3 ENSG0… chr1  1477… 1477… +          35        0        1
##  4 ENSG0… chr1… 1545… 1545… +;+        21        0        0
##  5 ENSG0… chr1  2360… 2360… -          50        1        2
##  6 ENSG0… chr2… 1160… 1160… +;+        36        1        0
##  7 ENSG0… chr2  8916… 8916… -          38      183        1
##  8 ENSG0… chr2  8916… 8916… -          37       98        2
##  9 ENSG0… chr2  8916… 8916… -          38       80        0
## 10 ENSG0… chr2  8916… 8916… -          38        0        0
## # … with 114 more rows, and 17 more variables:
## #   AA_Kidney <dbl>, AA_Liver <dbl>, AA_Lung <dbl>,
## #   AA_Stomach <dbl>, AF_Colon <dbl>, AF_Stomach <dbl>,
## #   BA_Colon <dbl>, BA_Heart <dbl>, BA_Kidney <dbl>,
## #   BA_Liver <dbl>, BA_Lung <dbl>, BA_Stomach <dbl>,
## #   BF_Colon <dbl>, BF_Stomach <dbl>, OA_Stomach1 <dbl>,
## #   OA_Stomach2 <dbl>, OA_Stomach3 <dbl>
# Filtering using regex/substring match
gse69360 %>%
  filter(grepl("chrY", Chr))
## # A tibble: 542 x 25
##    Geneid Chr   Start End   Strand Length AA_Colon AA_Heart
##    <chr>  <chr> <chr> <chr> <chr>   <dbl>    <dbl>    <dbl>
##  1 ENSGR… chrY… 1204… 1205… +;+       259        0        0
##  2 ENSGR… chrY… 1429… 1430… +;+;+…   6191        0        0
##  3 ENSGR… chrY… 1700… 1701… -;-;-…   2363        0        0
##  4 ENSGR… chrY… 2317… 2319… +;+       428        0        0
##  5 ENSGR… chrY… 2446… 2496… -;-;-…   9015        0        0
##  6 ENSGR… chrY  3753… 3754… -         101        0        0
##  7 ENSGR… chrY  4345… 4348… -         328        0        0
##  8 ENSGR… chrY  4559… 4560… -         117        0        0
##  9 ENSGR… chrY… 5350… 5353… +;+;+…   4384        0        0
## 10 ENSGR… chrY… 9009… 9011… +;+       588        0        0
## # … with 532 more rows, and 17 more variables:
## #   AA_Kidney <dbl>, AA_Liver <dbl>, AA_Lung <dbl>,
## #   AA_Stomach <dbl>, AF_Colon <dbl>, AF_Stomach <dbl>,
## #   BA_Colon <dbl>, BA_Heart <dbl>, BA_Kidney <dbl>,
## #   BA_Liver <dbl>, BA_Lung <dbl>, BA_Stomach <dbl>,
## #   BF_Colon <dbl>, BF_Stomach <dbl>, OA_Stomach1 <dbl>,
## #   OA_Stomach2 <dbl>, OA_Stomach3 <dbl>
# Two filters at a time
gse69360 %>%
  filter(Length <= 50 & grepl("chrY", Chr))
## # A tibble: 1 x 25
##   Geneid Chr   Start End   Strand Length AA_Colon AA_Heart
##   <chr>  <chr> <chr> <chr> <chr>   <dbl>    <dbl>    <dbl>
## 1 ENSG0… chrY… 5306… 5306… -;-        42        1        0
## # … with 17 more variables: AA_Kidney <dbl>,
## #   AA_Liver <dbl>, AA_Lung <dbl>, AA_Stomach <dbl>,
## #   AF_Colon <dbl>, AF_Stomach <dbl>, BA_Colon <dbl>,
## #   BA_Heart <dbl>, BA_Kidney <dbl>, BA_Liver <dbl>,
## #   BA_Lung <dbl>, BA_Stomach <dbl>, BF_Colon <dbl>,
## #   BF_Stomach <dbl>, OA_Stomach1 <dbl>, OA_Stomach2 <dbl>,
## #   OA_Stomach3 <dbl>

Select

  • select: Select/rename variables/columns by name
# Selecting columns that match a pattern
gse69360 %>%
  select(Geneid, matches(".F_"))
## # A tibble: 57,820 x 5
##    Geneid            AF_Colon AF_Stomach BF_Colon BF_Stomach
##    <chr>                <dbl>      <dbl>    <dbl>      <dbl>
##  1 ENSG00000223972.4        7          2        0          3
##  2 ENSG00000227232.4      102         41      153        157
##  3 ENSG00000243485.2        0          0        0          0
##  4 ENSG00000237613.2        0          0        0          0
##  5 ENSG00000268020.2        0          0        0          0
##  6 ENSG00000240361.1        0          0        0          0
##  7 ENSG00000186092.4        0          0        0          0
##  8 ENSG00000238009.2        4          0        5          4
##  9 ENSG00000239945.1        0          0        0          0
## 10 ENSG00000233750.3        2          3        3          1
## # … with 57,810 more rows
# Excluding specific columns
gse69360 %>%
  select(-Chr, -Start, -End, -Strand, -Length)
## # A tibble: 57,820 x 20
##    Geneid AA_Colon AA_Heart AA_Kidney AA_Liver AA_Lung
##    <chr>     <dbl>    <dbl>     <dbl>    <dbl>   <dbl>
##  1 ENSG0…        1        0         0        0       0
##  2 ENSG0…       36       21        32       28      17
##  3 ENSG0…        0        0         0        0       0
##  4 ENSG0…        0        0         0        0       0
##  5 ENSG0…        0        0         0        0       0
##  6 ENSG0…        0        0         0        0       0
##  7 ENSG0…        0        0         0        0       0
##  8 ENSG0…        3        0         1        2       1
##  9 ENSG0…        0        0         0        0       0
## 10 ENSG0…        0        0         0        0       0
## # … with 57,810 more rows, and 14 more variables:
## #   AA_Stomach <dbl>, AF_Colon <dbl>, AF_Stomach <dbl>,
## #   BA_Colon <dbl>, BA_Heart <dbl>, BA_Kidney <dbl>,
## #   BA_Liver <dbl>, BA_Lung <dbl>, BA_Stomach <dbl>,
## #   BF_Colon <dbl>, BF_Stomach <dbl>, OA_Stomach1 <dbl>,
## #   OA_Stomach2 <dbl>, OA_Stomach3 <dbl>
# Excluding columns matching a pattern
gse69360 %>%
  select(-matches("[AF]_"))
## # A tibble: 57,820 x 6
##    Geneid   Chr         Start      End        Strand  Length
##    <chr>    <chr>       <chr>      <chr>      <chr>    <dbl>
##  1 ENSG000… chr1;chr1;… 11869;125… 12227;127… +;+;+;+   1756
##  2 ENSG000… chr1;chr1;… 14363;149… 14829;150… -;-;-;…   2073
##  3 ENSG000… chr1;chr1;… 29554;302… 30039;306… +;+;+     1021
##  4 ENSG000… chr1;chr1;… 34554;352… 35174;354… -;-;-     1219
##  5 ENSG000… chr1;chr1   52473;548… 53312;549… +;+        947
##  6 ENSG000… chr1        62948      63887      +          940
##  7 ENSG000… chr1        69091      70008      +          918
##  8 ENSG000… chr1;chr1;… 89295;920… 91629;922… -;-;-;…   3569
##  9 ENSG000… chr1;chr1   89551;902… 90050;911… -;-       1319
## 10 ENSG000… chr1        131025     134836     +         3812
## # … with 57,810 more rows
# Select then Filter
gse69360 %>%
  select(Geneid, Chr, Length, matches("[AF]_")) %>%
  filter(grepl("chrY", Chr) | Length <= 100)
## # A tibble: 4,671 x 22
##    Geneid Chr   Length AA_Colon AA_Heart AA_Kidney AA_Liver
##    <chr>  <chr>  <dbl>    <dbl>    <dbl>     <dbl>    <dbl>
##  1 ENSG0… chr1…     90        0        0         0        0
##  2 ENSG0… chr1…     57        0        0         0        0
##  3 ENSG0… chr1      95        5        0         2        0
##  4 ENSG0… chr1      90        0        0         0        0
##  5 ENSG0… chr1      83        7        0         1        2
##  6 ENSG0… chr1      96        0        0         0        0
##  7 ENSG0… chr1      70        0        0         0        0
##  8 ENSG0… chr1      73        0        0         0        0
##  9 ENSG0… chr1      89        0        0         0        0
## 10 ENSG0… chr1      70        0        0         0        0
## # … with 4,661 more rows, and 15 more variables:
## #   AA_Lung <dbl>, AA_Stomach <dbl>, AF_Colon <dbl>,
## #   AF_Stomach <dbl>, BA_Colon <dbl>, BA_Heart <dbl>,
## #   BA_Kidney <dbl>, BA_Liver <dbl>, BA_Lung <dbl>,
## #   BA_Stomach <dbl>, BF_Colon <dbl>, BF_Stomach <dbl>,
## #   OA_Stomach1 <dbl>, OA_Stomach2 <dbl>, OA_Stomach3 <dbl>

Mutate

  • mutate: Adds new variables; keeps existing variables
  • transmute: Adds new variables; drops existing variables
# Excluding columns matching a condition
gse69360 %>%
  select(-matches("[AF]_")) %>%
  head(., 10) %>% View()

# Storing gene location information in a seprate data frame
gene_loc <- gse69360 %>%                                            # saving output to a variable
  select(-matches("[AF]_")) %>%                                     # select columns
  mutate(Geneid = gsub("\\.[0-9]*$", "", Geneid)) %>%               # remove isoform no.
  mutate(Chr = gsub(";.*$", "", gse69360$Chr)) %>%                  # keep the first element for Chr
  mutate(Start = as.numeric(gsub(";.*$", "", gse69360$Start))) %>%  # "" for Start
  mutate(End = as.numeric(gsub(";.*$", "", gse69360$End))) %>%      # "" for End
  mutate(Strand = gsub(";.*$", "", gse69360$Strand))                # "" for Strand

# Check to see if you have what you expected!
gene_loc
## # A tibble: 57,820 x 6
##    Geneid          Chr    Start    End Strand Length
##    <chr>           <chr>  <dbl>  <dbl> <chr>   <dbl>
##  1 ENSG00000223972 chr1   11869  12227 +        1756
##  2 ENSG00000227232 chr1   14363  14829 -        2073
##  3 ENSG00000243485 chr1   29554  30039 +        1021
##  4 ENSG00000237613 chr1   34554  35174 -        1219
##  5 ENSG00000268020 chr1   52473  53312 +         947
##  6 ENSG00000240361 chr1   62948  63887 +         940
##  7 ENSG00000186092 chr1   69091  70008 +         918
##  8 ENSG00000238009 chr1   89295  91629 -        3569
##  9 ENSG00000239945 chr1   89551  90050 -        1319
## 10 ENSG00000233750 chr1  131025 134836 +        3812
## # … with 57,810 more rows
View(head(gene_loc, 10))

# Creating new variables
gene_loc %>%
  mutate(kbStart = Start/1000,     # creates new variables/columns
         kbEnd = End/1000,
         kbLength = Length/1000)
## # A tibble: 57,820 x 9
##    Geneid Chr    Start    End Strand Length kbStart kbEnd
##    <chr>  <chr>  <dbl>  <dbl> <chr>   <dbl>   <dbl> <dbl>
##  1 ENSG0… chr1   11869  12227 +        1756    11.9  12.2
##  2 ENSG0… chr1   14363  14829 -        2073    14.4  14.8
##  3 ENSG0… chr1   29554  30039 +        1021    29.6  30.0
##  4 ENSG0… chr1   34554  35174 -        1219    34.6  35.2
##  5 ENSG0… chr1   52473  53312 +         947    52.5  53.3
##  6 ENSG0… chr1   62948  63887 +         940    62.9  63.9
##  7 ENSG0… chr1   69091  70008 +         918    69.1  70.0
##  8 ENSG0… chr1   89295  91629 -        3569    89.3  91.6
##  9 ENSG0… chr1   89551  90050 -        1319    89.6  90.0
## 10 ENSG0… chr1  131025 134836 +        3812   131.  135. 
## # … with 57,810 more rows, and 1 more variable:
## #   kbLength <dbl>
# Creating new variables & dropping old ones
gene_loc %>%
  transmute(kbStart = Start/1000,  # drops original columns
            kbEnd = End/1000,
            kbLength = Length/1000)
## # A tibble: 57,820 x 3
##    kbStart kbEnd kbLength
##      <dbl> <dbl>    <dbl>
##  1    11.9  12.2    1.76 
##  2    14.4  14.8    2.07 
##  3    29.6  30.0    1.02 
##  4    34.6  35.2    1.22 
##  5    52.5  53.3    0.947
##  6    62.9  63.9    0.94 
##  7    69.1  70.0    0.918
##  8    89.3  91.6    3.57 
##  9    89.6  90.0    1.32 
## 10   131.  135.     3.81 
## # … with 57,810 more rows

Distinct & Arrange

  • distinct: Pick unique entries
  • arrange: Arrange rows by variables
# Pick only the unique entries in a column
gene_loc %>%
  distinct(Chr)
## # A tibble: 25 x 1
##    Chr  
##    <chr>
##  1 chr1 
##  2 chr2 
##  3 chr3 
##  4 chr4 
##  5 chr5 
##  6 chr6 
##  7 chr7 
##  8 chr8 
##  9 chr9 
## 10 chr10
## # … with 15 more rows
gene_loc %>%
  distinct(Strand)
## # A tibble: 2 x 1
##   Strand
##   <chr> 
## 1 +     
## 2 -
# Pick unique combinations
gene_loc %>%
  distinct(Chr, Strand)
## # A tibble: 50 x 2
##    Chr   Strand
##    <chr> <chr> 
##  1 chr1  +     
##  2 chr1  -     
##  3 chr2  -     
##  4 chr2  +     
##  5 chr3  +     
##  6 chr3  -     
##  7 chr4  -     
##  8 chr4  +     
##  9 chr5  +     
## 10 chr5  -     
## # … with 40 more rows
# Then sort aka arrange your data
gene_loc %>%
  arrange(desc(Chr))    # sort in descending order
## # A tibble: 57,820 x 6
##    Geneid          Chr    Start    End Strand Length
##    <chr>           <chr>  <dbl>  <dbl> <chr>   <dbl>
##  1 ENSGR0000228572 chrY  120410 120513 +         259
##  2 ENSGR0000182378 chrY  142989 143061 +        6191
##  3 ENSGR0000178605 chrY  170025 170137 -        2363
##  4 ENSGR0000226179 chrY  231725 231983 +         428
##  5 ENSGR0000167393 chrY  244698 249631 -        9015
##  6 ENSGR0000266731 chrY  375316 375416 -         101
##  7 ENSGR0000234958 chrY  434510 434837 -         328
##  8 ENSGR0000229232 chrY  455971 456087 -         117
##  9 ENSGR0000185960 chrY  535079 535337 +        4384
## 10 ENSGR0000237531 chrY  900956 901162 +         588
## # … with 57,810 more rows
gene_loc %>%
  arrange(Chr, Length)  # sort by Chr, then Length
## # A tibble: 57,820 x 6
##    Geneid          Chr       Start       End Strand Length
##    <chr>           <chr>     <dbl>     <dbl> <chr>   <dbl>
##  1 ENSG00000268141 chr1  154585067 154585080 +          21
##  2 ENSG00000224335 chr1  147706573 147706607 +          35
##  3 ENSG00000263526 chr1   95211416  95211456 -          41
##  4 ENSG00000230610 chr1   38292232  38292262 -          45
##  5 ENSG00000252056 chr1  236026688 236026737 -          50
##  6 ENSG00000238705 chr1   26881033  26881084 +          52
##  7 ENSG00000264650 chr1   45499641  45499692 -          52
##  8 ENSG00000265769 chr1   52366592  52366643 -          52
##  9 ENSG00000251795 chr1   93303575  93303627 +          53
## 10 ENSG00000238310 chr1  109750416 109750470 -          55
## # … with 57,810 more rows
# arrange(Chr, -Length) # to reverse sort by 'numeric' Length

Group_by & Summarize

  • summarize: Reduces multiple values down to a single value
  • group_by: Combine entries by one or more variables
# Combine by a variable, then calculate summary statistics for each group
gene_loc %>%
  group_by(Chr) %>%                         # combine rows by Chr
  summarize(numGenes = n(),                 # then summarise, number of genes/Chr
            startoffirstGene = min(Start))  # min to get the first Start location
## # A tibble: 25 x 3
##    Chr   numGenes startoffirstGene
##    <chr>    <int>            <dbl>
##  1 chr1      5363            11869
##  2 chr10     2260            90652
##  3 chr11     3208            75780
##  4 chr12     2818            67607
##  5 chr13     1217         19041312
##  6 chr14     2244         19110203
##  7 chr15     2080         20083769
##  8 chr16     2343            61553
##  9 chr17     2903             4961
## 10 chr18     1127            11103
## # … with 15 more rows
# Example to show you can use all math/stat functions to summarize data groups
gene_loc %>%
  arrange(Length) %>%
  group_by(Chr, Strand) %>%
  summarize(numGenes = n(),
            smallestGene = first(Geneid),
            minLength = min(Length),
            firstqLength = quantile(Length, 0.25),
            medianLength = median(Length),
            iqrLength = IQR(Length),
            thirdqLength = quantile(Length, 0.75),
            maxLength = max(Length),
            longestGene = last(Geneid))
## # A tibble: 50 x 11
## # Groups:   Chr [25]
##    Chr   Strand numGenes smallestGene minLength firstqLength
##    <chr> <chr>     <int> <chr>            <dbl>        <dbl>
##  1 chr1  -          2651 ENSG0000026…        41         354.
##  2 chr1  +          2712 ENSG0000026…        21         394.
##  3 chr10 -          1109 ENSG0000026…        52         322 
##  4 chr10 +          1151 ENSG0000027…        27         356 
##  5 chr11 -          1605 ENSG0000026…        30         477 
##  6 chr11 +          1603 ENSG0000025…        37         476 
##  7 chr12 -          1415 ENSG0000025…        28         375 
##  8 chr12 +          1403 ENSG0000026…        30         402 
##  9 chr13 -           631 ENSG0000021…        48         297 
## 10 chr13 +           586 ENSG0000026…        57         371.
## # … with 40 more rows, and 5 more variables:
## #   medianLength <dbl>, iqrLength <dbl>,
## #   thirdqLength <dbl>, maxLength <dbl>, longestGene <chr>
# Renaming chromsosomes & declaring them as factors w/ an intrinsic order
# gene_loc$Chr <- factor(gene_loc$Chr,
#                        levels = paste("chr",
#                                       c((1:22), "X", "Y", "M"),
#                                       sep=""))

# Saving your data locally
gene_loc %>%
  write_tsv(here("data/GSE69360.gene-locations.txt"))

More data wrangling

Let’s combine everything from above to tidy the full GSE69360 dataset.

Dataset details:

  • AA: Agilent Adult; AF: Agilent Fetus
  • BA: BioChain Adult; BF: BioChain Fetus
  • OA: OriGene Adult
  • Tissues: Colon, Heart, Kidney, Liver, Lung, Stomach
# Extracting just the expression values & cleaning it up
View(head(gse69360, 50))

gene_counts <- gse69360 %>%
  select(-Chr, -Start, -End, -Strand, -Length) %>%          # another way to select just the expression data
  rename(OA_Stomach = OA_Stomach1) %>%                      # rename couple of columns
  mutate(OA_Stomach2 = NULL, OA_Stomach3 = NULL) %>%        # remove a couple of columns
  mutate(Geneid = gsub("\\.[0-9]*$", "", Geneid))           # cleanup data a specific column

logcpm <- gene_counts %>%
  select(-Geneid) %>%
  mutate_all(function(x) { log2((x*(1e+6)/sum(x)) + 1) } )  # convert counts in each sample to counts-per-million

summary(logcpm)
##     AA_Colon           AA_Heart         AA_Kidney      
##  Min.   : 0.00000   Min.   : 0.0000   Min.   : 0.0000  
##  1st Qu.: 0.00000   1st Qu.: 0.0000   1st Qu.: 0.0000  
##  Median : 0.06883   Median : 0.0000   Median : 0.0000  
##  Mean   : 1.10187   Mean   : 0.8103   Mean   : 0.8785  
##  3rd Qu.: 1.33045   3rd Qu.: 0.5151   3rd Qu.: 0.7181  
##  Max.   :18.41311   Max.   :18.7478   Max.   :18.8061  
##     AA_Liver           AA_Lung         AA_Stomach      
##  Min.   : 0.00000   Min.   : 0.000   Min.   : 0.00000  
##  1st Qu.: 0.00000   1st Qu.: 0.000   1st Qu.: 0.00000  
##  Median : 0.05563   Median : 0.000   Median : 0.06964  
##  Mean   : 0.98788   Mean   : 1.049   Mean   : 0.93120  
##  3rd Qu.: 0.95867   3rd Qu.: 1.123   3rd Qu.: 0.91851  
##  Max.   :17.85244   Max.   :17.722   Max.   :17.85986  
##     AF_Colon         AF_Stomach         BA_Colon       
##  Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.00000  
##  1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.00000  
##  Median : 0.1995   Median : 0.0000   Median : 0.08114  
##  Mean   : 1.2903   Mean   : 0.7748   Mean   : 0.98586  
##  3rd Qu.: 1.8951   3rd Qu.: 0.8105   3rd Qu.: 1.10906  
##  Max.   :17.4620   Max.   :18.3586   Max.   :18.75742  
##     BA_Heart          BA_Kidney           BA_Liver      
##  Min.   : 0.00000   Min.   : 0.00000   Min.   : 0.0000  
##  1st Qu.: 0.00000   1st Qu.: 0.00000   1st Qu.: 0.0000  
##  Median : 0.04537   Median : 0.05856   Median : 0.0000  
##  Mean   : 0.80833   Mean   : 0.96659   Mean   : 0.8194  
##  3rd Qu.: 0.68444   3rd Qu.: 1.11092   3rd Qu.: 0.5974  
##  Max.   :18.92846   Max.   :18.87617   Max.   :17.7612  
##     BA_Lung          BA_Stomach         BF_Colon      
##  Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.0000  
##  1st Qu.: 0.1643   1st Qu.: 0.0000   1st Qu.: 0.0000  
##  Median : 0.4456   Median : 0.0000   Median : 0.1793  
##  Mean   : 1.3873   Mean   : 0.7528   Mean   : 1.3716  
##  3rd Qu.: 1.9392   3rd Qu.: 0.6261   3rd Qu.: 1.9775  
##  Max.   :17.5495   Max.   :18.3249   Max.   :15.8531  
##    BF_Stomach        OA_Stomach     
##  Min.   : 0.0000   Min.   : 0.0000  
##  1st Qu.: 0.0000   1st Qu.: 0.0000  
##  Median : 0.3085   Median : 0.1003  
##  Mean   : 1.5164   Mean   : 1.0800  
##  3rd Qu.: 2.4221   3rd Qu.: 1.3692  
##  Max.   :15.7256   Max.   :17.9600
gene_logcpm <- gene_counts %>%                              # a new dataframe with logcpm values
  select(Geneid) %>%
  bind_cols(logcpm) %>%                                     # bind the logcpm matrix to the geneids
  gather(-Geneid, key = "Sample", value = "Logcpm") %>%     # convert to tidy data
  separate(Sample,                                          # cleanup complex variables
           into = c("Source", "Stage", "Tissue"),
           sep = c(1,2),
           remove = F) %>%                                  # keep original variable
  mutate(Tissue = gsub("^_", "", Tissue),
         Stage = ifelse(Stage == "A", "Adult", "Fetus"))

View(head(gene_logcpm, 50))

# Plotting the distribution of gene-expression in each sample
gene_logcpm %>%
  ggplot(aes(x = Sample, y = Logcpm, color = Tissue, linetype = Stage)) +
  geom_boxplot(outlier.size = 0.2, outlier.shape = 0.2) +
  scale_y_continuous(limits = c(0, 1)) +
  coord_flip() +
  theme_minimal()
## Warning: Removed 259347 rows containing non-finite values
## (stat_boxplot).


Part 4: Visualizing tidy data w/ ggplot

Basics of ggplot2

Creating a plot w/ Grammar of Graphics

  1. Recap and continuation of dplyr
  2. Basics of plotting data with ggplot2: data, aes, geom
  3. Customization: Colors, labels, and legends

Barplots & Histograms

  • ggplot, factor, aes
  • geom_bar, geom_histogram
  • facet_wrap
  • scale_x_log10, labs, coord_flip, theme, theme_minimal
gene_loc %>%                                                      # data
  ggplot(aes(x = Chr)) +                                          # aesthetics: what to plot?
  geom_bar()                                                      # geometry: how to plot?

gene_loc$Chr <- factor(gene_loc$Chr,
                       levels = paste("chr",
                                      c((1:22), "X", "Y", "M"),
                                      sep=""))

plot_chr_numgenes <- gene_loc %>%
  ggplot(aes(x = Chr)) +
  geom_bar()
plot_chr_numgenes

plot_chr_numgenes +
  coord_flip() +
  theme_minimal()

plot_chr_numgenes +
  labs(title = "No. genes per chromosome",
       x = "Chromosome",
       y = "No. of genes") +
  theme_minimal() +
  coord_flip()

gene_loc %>%
  ggplot(aes(x = Length)) +
  geom_histogram(color = "white") +
  scale_x_log10() +
  theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.

plot_chr_genelength <- gene_loc %>%
  ggplot(aes(x = Length, fill = Chr)) +
  geom_histogram(color = "white") +
  scale_x_log10() +
  theme_minimal() +
  facet_wrap(~Chr, scales = "free_y")
plot_chr_genelength
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.

plot_chr_genelength +
  theme(legend.position = "none") +
  labs(x = "Gene length (log-scale)",
       y = "No. of genes")
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.

Scatter plots

  • geom_point
  • geom_abline, geom_vline, geom_hline
  • geom_smooth, geom_text_repel
gene_loc %>%
  ggplot(aes(x = End-Start, y = Length)) +
  geom_point()

plot_strend_length <- gene_loc %>%
  ggplot(aes(x = End-Start, y = Length)) +
  geom_point(alpha = 0.1, size = 0.5, color = "grey", fill = "grey")
plot_strend_length

plot_strend_length <- plot_strend_length +
  scale_x_log10("End-Start") +
  scale_y_log10("Gene length") +
  theme_minimal()
plot_strend_length
## Warning: Transformation introduced infinite values in
## continuous x-axis

plot_strend_length +
  geom_abline(intercept = 0, slope = 1, col = "red") +
  geom_hline(yintercept = 500, color = "blue") +
  geom_vline(xintercept = 1000, color = "orange")
## Warning: Transformation introduced infinite values in
## continuous x-axis

gene_loc %>%
  group_by(Chr) %>%
  summarize(meanLength = mean(Length), numGenes = n())
## # A tibble: 25 x 3
##    Chr   meanLength numGenes
##    <fct>      <dbl>    <int>
##  1 chr1       2258.     5363
##  2 chr2       2304.     4047
##  3 chr3       2382.     3101
##  4 chr4       2109.     2563
##  5 chr5       2188.     2859
##  6 chr6       2124.     2905
##  7 chr7       2164.     2876
##  8 chr8       2036.     2386
##  9 chr9       2123.     2323
## 10 chr10      2160.     2260
## # … with 15 more rows
gene_loc %>%
  group_by(Chr) %>%
  summarize(meanLength = mean(Length), numGenes = n()) %>%
  ggplot(aes(x = numGenes, y = meanLength)) +
  geom_point()

library(ggrepel)
gene_loc %>%
  group_by(Chr) %>%
  summarize(meanLength = mean(Length), numGenes = n()) %>%
  ggplot(aes(x = numGenes, y = meanLength)) +
  geom_point() +
  geom_smooth(color = "lightblue", alpha = 0.1) +
  labs(x = "No. of genes", y = "Mean gene length") +
  geom_text_repel(aes(label = Chr), color="red", segment.color="grey80") +
  theme_minimal()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Boxplots & Violin plots

  • geom_boxplot, geom_violin
  • scale_y_continuous
gene_logcpm %>%
  ggplot(aes(x = Sample, y = Logcpm, color = Tissue, linetype = Stage)) +
  geom_boxplot() +
  coord_flip() +
  theme_minimal()

gene_logcpm %>%
  ggplot(aes(x = Sample, y = Logcpm, color = Tissue, linetype = Stage)) +
  geom_violin() +
  scale_y_continuous(limits = c(0, 0.5)) +
  coord_flip() +
  theme_minimal()
## Warning: Removed 320317 rows containing non-finite values
## (stat_ydensity).

# Plotting the distribution of gene-expression in each sample
plot_sample_bxp <- gene_logcpm %>%
  ggplot(aes(x = Sample, y = Logcpm, color = Tissue, linetype = Stage)) +
  geom_boxplot(outlier.size = 0.2, outlier.alpha = 0.2) +
  scale_y_continuous(limits = c(0, 1)) +
  coord_flip() +
  theme_minimal()
plot_sample_bxp
## Warning: Removed 259347 rows containing non-finite values
## (stat_boxplot).

# Plotting scatterplot of 2 sets of samples
plot_ffcolon_scatter <- gene_logcpm %>%
  filter(Sample == "AF_Colon" | Sample == "BF_Colon") %>%
  select(Geneid, Sample, Logcpm) %>%
  spread(key = Sample, value = Logcpm) %>%
  ggplot(aes(x = AF_Colon, y = BF_Colon)) +
  geom_point(alpha = 0.1, size = 0.5) +
  geom_smooth(method=lm) +
  theme_minimal()
plot_ffcolon_scatter

Some data sleuthing

# Finding genes with high variance across samples
num_totgenes <- gene_logcpm %>%
  distinct(Geneid) %>%
  nrow()

highvar_genes <- gene_logcpm %>%
  group_by(Geneid) %>%
  summarize(iqr = IQR(Logcpm)) %>%
  top_n((ceiling(num_totgenes*0.05)), iqr) %>%
  pull(Geneid)

length(highvar_genes)
## [1] 2891
# Plotting expression of high-var Y chr genes across samples
chry_highvar_genes <- gene_loc %>%
  filter(Chr == "chrY" & Geneid %in% highvar_genes) %>%
  pull(Geneid)

gene_logcpm %>%
  filter(Geneid %in% chry_highvar_genes)
## # A tibble: 238 x 6
##    Geneid          Sample   Source Stage Tissue Logcpm
##    <chr>           <chr>    <chr>  <chr> <chr>   <dbl>
##  1 ENSG00000129824 AA_Colon A      Adult Colon    6.12
##  2 ENSG00000067646 AA_Colon A      Adult Colon    4.89
##  3 ENSG00000231535 AA_Colon A      Adult Colon    3.12
##  4 ENSG00000099725 AA_Colon A      Adult Colon    3.55
##  5 ENSG00000233864 AA_Colon A      Adult Colon    3.73
##  6 ENSG00000114374 AA_Colon A      Adult Colon    5.67
##  7 ENSG00000067048 AA_Colon A      Adult Colon    5.62
##  8 ENSG00000183878 AA_Colon A      Adult Colon    6.05
##  9 ENSG00000165246 AA_Colon A      Adult Colon    2.31
## 10 ENSG00000176728 AA_Colon A      Adult Colon    2.52
## # … with 228 more rows
plot_chry_highvar_boxplot <- gene_logcpm %>%
  filter(Geneid %in% chry_highvar_genes) %>%
  ggplot(aes(x = reorder(Sample, Logcpm, FUN = median),
             y = Logcpm,
             color = Sample)) +
  geom_boxplot() +
  coord_flip() +
  theme_minimal() +
  theme(legend.position = "none")
plot_chry_highvar_boxplot


Part 5: Export & Wrap-up w/ RMarkdown

Saving your plots

ggsave

Save a ggplot (or other grid object) with sensible defaults

library(tidyverse)
# Save your file name
plot1 <- "my_plot1.tiff"

# Save your absolute/relative path
my_full_path <- here("data")

# To save as a tab-delimited text file ...
ggsave(filename=plot1,
       plot=static_plot,
       device="tiff",
       path=my_full_path,
       dpi=600)

Saving your data files

write_delim

Write a data frame to a delimited file

library(tidyverse)
# Save your file name
filename <- "my_new_data.txt"

# Save your absolute/relative path
my_full_path <- here("data")

# To save as a tab-delimited text file ...
write_tsv(x=my_newly_formatted_data, # your final reformatted dataset
          path=paste(my_full_path, filename, "/"), # Absolute path recommended.
          # However, you can directly use 'filename' here
          # if you are saving the file in the same directory
          # as your code.
          col_names=T) # if you want the column names to be
# saved in the first row, recommended

# Alternatively, you could save it as a comma-separated text file
write_csv(x=my_newly_formatted_data,
          path=my_path,
          col_names=T)
# Or save it with any other delimiter
# choose wisely, pick a delim that's not part of your dataframe
write_delim(x=my_newly_formatted_data,
            path=my_path,
            col_names=T,
            delim="---")

What you learnt today!

Option Description
Part 1 Getting Started
install.packages Download and install packages from CRAN-like repositories or from local files
library Library and require load and attach add-on packages
package::function To run a function w/o loading the package
Import tidyverse > readr & readxl
read_delim Read a delimited file (incl csv, tsv) into a tibble
read_csv, read_tsv read_csv() and read_tsv() are special cases of the general read_delim()
read_excel Read xls and xlsx files
here Part of here package. To set paths relative to your current project directory
Data snapshot
str Compactly Display the Structure of an Arbitrary R Object
head Return the First or Last Part of an Object
glimpse Get a glimpse of your data
View Invoke a Data Viewer
kable Create tables in LaTeX, HTML, Markdown and reStructuredText
paged_table Create a table in HTML with support for paging rows and columns
Part 2 tidyverse > tidyr
gather Gather Columns Into Key-Value Pairs (COLS -> ROWS)
spread Spread a key-value pair across multiple columns
separate Separate one column into multiple column
unite Unite multiple columns into one
Part 3 tidyverse > dplyr
filter Return rows with matching conditions
select Select/rename variables by name
mutate Add new variables
transmute Add new variables & drops existing variables
distinct Get unique entries
arrange Arrange rows by variables
summarise Reduces multiple values down to a single value
group_by Group by one or more variables
join Join two tbls together: left_join, right_join, inner_join
bind Efficiently bind multiple data frames by row and column: bind_rows, bind_cols
setops Set operations: intersect, union, setdiff, setequal
Part 4 tidyverse > ggplot
ggplot Create a new ggplot
Aesthetics
aes Specify details on what to plot
factor Specify a variable to be ordered & categorical
facet_wrap Split into multiple sub-plots based on a categorical variable
Geometries
geom_bar Create a barplot
geom_histogram Create a histogram
geom_point Create a scatter plot
geom_boxplot Create a boxplot
geom_violin Create a violin plot
geom_abline, geom_hline, geom_vline Add slanted, horizontal, or vertical lines
geom_smooth Add a smooth curve that fits the data points
Plot customization
theme, theme_minimal, theme_bw Adjust themes to minimal, black/white, etc
scale_x_log10 Change x/y axes to log scale
scale_y_continuous Change x/y axes to continuous & set limits
coord_flip Flip x & y coordinates
labs Specify axes labels and plot titles
Part 5 Export & Wrap-up
Export tidyverse > readr
ggsave Save a ggplot (or other grid object) with sensible defaults
write_delim Write a data frame to a delimited file
write_tsv write_delim customized for tab-separated values
write_csv write_delim customized for comma-separated values

Credits

Arjun Krishnan and I co-developed the content for this workshop.

Additional resources

Some awesome open-source books